INTERSPEECH.2013 - Speech Recognition | Cool Papers

#1 Lattice-based training of bottleneck feature extraction neural networks [PDF] [Copy] [Kimi¹]

This paper investigates a method for training bottleneck (BN) features in a more targeted manner for their intended use in GMM-HMM based ASR. Our approach adds a GMM acoustic model activation layer to a standard BN feature extraction (FE) neural network and performs lattice-based MMI training on the resulting network. After training, the network is reverted back into a working BN FE network by removing the GMM activation layer, and we then train a GMM system on top of the bottleneck features in the normal way. Our results show that this approach can significantly improve recognition accuracy when compared to a baseline system that uses standard BN features. Further, we show that our approach can be used to perform unsupervised speaker adaptation, yielding significantly improved results compared to global cMLLR adaptation.

#2 Modular combination of deep neural networks for acoustic modeling [PDF] [Copy] [Kimi¹]

Authors: Jonas Gehring ; Wonkyum Lee ; Kevin Kilgour ; Ian Lane ; Yajie Miao ; Alex Waibel

In this work, we propose a modular combination of two popular applications of neural networks to large-vocabulary continuous speech recognition. First, a deep neural network is trained to extract bottleneck features from frames of mel scale filterbank coefficients. In a similar way as is usually done for GMM/HMM systems, this network is then applied as a non-linear discriminative feature-space transformation for a hybrid setup where acoustic modeling is performed by a deep belief network. This effectively results in a very large network, where the layers of the bottleneck network are fixed and applied to successive windows of feature frames in a time-delay fashion. We show that bottleneck features improve the recognition performance of DBN/HMM hybrids, and that the modular combination enables the acoustic model to benefit from a larger temporal context. Our architecture is evaluated on a recently released and challenging Tagalog corpus containing conversational telephone speech.

#3 Informative spectro-temporal bottleneck features for noise-robust speech recognition [PDF] [Copy] [Kimi¹]

Authors: Shuo-Yiin Chang ; Nelson Morgan

Spectro-temporal Gabor features based on auditory knowledge have improved word accuracy for automatic speech recognition in the presence of noise. In previous work, we generated robust spectro-temporal features that incorporated the power normalized cepstral coefficient (PNCC) algorithm. The corresponding power normalized spectrum (PNS) is then processed by many Gabor filters, yielding a high dimensional feature vector. In tandem processing, an MLP with one hidden layer is often employed to learn discriminative transformations from front end features, in this case Gabor filtered power spectra, to probabilistic features, which are referred as PNS-Gabor MLP. Here we improve PNS-Gabor MLP in two ways. First, we select informative Gabor features using sparse principle component analysis (sparse PCA) before tandem processing. Second, we use a deep neural network (DNN) with bottleneck structure. Experiments show that the high-dimensional Gabor features are redundant. In our experiment, sparse principal component analysis suggests Gabor filters with longer time scales are particularly informative. The best of our experimental modifications gave an error rate reduction of 15.5% relative to PNS-Gabor MLP plus MFCC, and 41.4% better than an MFCC baseline on a large vocabulary continuous speech recognition task using noisy data.

#4 A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR [PDF] [Copy] [Kimi¹]

Authors: Zhi-Jie Yan ; Qiang Huo ; Jian Xu

We present a new scalable approach to using deep neural network (DNN) derived features in Gaussian mixture density hidden Markov model (GMM-HMM) based acoustic modeling for large vocabulary continuous speech recognition (LVCSR). The DNN-based feature extractor is trained from a subset of training data to mitigate the scalability issue of DNN training, while GMM-HMMs are trained by using state-of-the-art scalable training methods and tools to leverage the whole training set. In a benchmark evaluation, we used 309-hour Switchboard-I (SWB) training data to train a DNN first, which achieves a word error rate (WER) of 15.4% on NIST-2000 Hub5 evaluation set by a traditional DNN-HMM based approach. When the same DNN is used as a feature extractor and 2,000-hour "SWB+Fisher" training data is used to train the GMM-HMMs, our DNN-GMM-HMM approach achieves a WER of 13.8%. If per-conversation-side based unsupervised adaptation is performed, a WER of 13.1% can be achieved.

#5 Improved feature processing for deep neural networks [PDF] [Copy] [Kimi¹]

Authors: Shakti P. Rath ; Daniel Povey ; Karel Veselý ; Jan Černocký

In this paper, we investigate alternative ways of processing MFCCbased features to use as the input to Deep Neural Networks (DNNs). Our baseline is a conventional feature pipeline that involves splicing the 13-dimensional front-end MFCCs across 9 frames, followed by applying LDA to reduce the dimension to 40 and then further decorrelation using MLLT. Confirming the results of other groups, we show that speaker adaptation applied on the top of these features using feature-space MLLR is helpful. The fact that the number of parameters of a DNN is not strongly sensitive to the input feature dimension (unlike GMM-based systems) motivated us to investigate ways to increase the dimension of the features. In this paper, we investigate several approaches to derive higher-dimensional features and verify their performance with DNN. Our best result is obtained from splicing our baseline 40-dimensional speaker adapted features again across 9 frames, followed by reducing the dimension to 200 or 300 using another LDA. Our final result is about 3% absolute better than our best GMM system, which is a discriminatively trained model.

#6 Deep vs. wide: depth on a budget for robust speech recognition [PDF] [Copy] [Kimi]

Authors: Oriol Vinyals ; Nelson Morgan

It has now been established that incorporating neural networks can be useful for speech recognition, and that machine learning methods can make it practical to incorporate a larger number of hidden layers in a "deep" structure. Here we incorporate the constraint of freezing the number of parameters for a given task, which in many applications corresponds to practical limitations on storage or computation. Given this constraint, we vary the size of each hidden layer as we change the number of layers so as to keep the total number of parameters constant. In this way we have determined, for a common task of noisy speech recognition (Aurora2), that a large number of layers is not always optimum; for each noise level there is an optimum number of layers. We also use state-of-the-art optimization algorithms to further understand the effect of initialization and convergence properties of such networks, and to have an efficient implementation that allows us to run more experiments with a standard desktop machine with a single GPU.

#7 Improving LVCSR with hidden conditional random fields for grapheme-to-phoneme conversion [PDF] [Copy] [Kimi¹]

Authors: Stefan Hahn ; Patrick Lehnen ; Simon Wiesler ; Ralf Schlüter ; Hermann Ney

In virtually every state-of-the-art large vocabulary continuous speech recognition (LVCSR) system, grapheme-to-phoneme (G2P) conversion is applied to generalize beyond a fixed set of words given by a background lexicon. The overall performance of the G2P system has a strong effect on the recognition quality. Typically, generative models based on joint-n-grams are used, although some discriminative models have a competitive performance but the training time may be quite large. In this work, the effect of using discriminative G2P modeling based on hidden conditional random fields (HCRFs) is analyzed. Besides measuring and comparing the G2P qualities on a textual level, one focus is the performance of LVCSR systems. Although the HCRF model does not outperform the generative one on text data, we could improve our English QUAERO ASR system by 1.3% relative on a couple of test corpora over a strong baseline by only replacing the G2P strategy.

#8 Context-dependent phone mapping for LVCSR of under-resourced languages [PDF] [Copy] [Kimi¹]

Authors: Van Hai Do ; Xiong Xiao ; Eng Siong Chng ; Haizhou Li

This paper presents a context-dependent phone mapping approach for acoustic modeling of large vocabulary speech recognition for under-resourced languages by leveraging on well trained models of other languages. Generally speaking, phone mapping can be considered as a hybrid HMM/MLP (Hidden Markov Model / Multilayer Perceptron) model where the input of the MLP is phone acoustic scores, e.g. likelihood or posterior scores. In this paper, we use deep neural networks trained with a lot of Malay training data to generate bottleneck and posterior features for the target English acoustic models. We extend the concept of phone mapping by using not only posteriors but also bottleneck feature as the input for phone mapping. Experiments show that the phone mapping technique outperforms the cross-lingual tandem approach significantly. In addition, we also show that bottleneck and posterior features contain complementary information. A consistent improvement is obtained by combining these two feature streams to form the input for phone mapping.

#9 Improving grapheme-based ASR by probabilistic lexical modeling approach [PDF] [Copy] [Kimi¹]

Authors: Ramya Rasipuram ; Mathew Magimai-Doss

There is growing interest in using graphemes as subword units, especially in the context of the rapid development of hidden Markov model (HMM) based automatic speech recognition (ASR) system, as it eliminates the need to build a phoneme pronunciation lexicon. However, directly modeling the relationship between acoustic feature observations and grapheme states may not be always trivial. It usually depends upon the grapheme-to-phoneme relationship within the language. This paper builds upon our recent interpretation of Kullback-Leibler divergence based HMM (KL-HMM) as a probabilistic lexical modeling approach to propose a novel grapheme-based ASR approach where, first a set of acoustic units are derived by modeling context-dependent graphemes in the framework of conventional HMM/Gaussian mixture model (HMM/GMM) system, and then the probabilistic relationship between the derived acoustic units and the lexical units representing graphemes is modeled in the framework of KL-HMM. Through experimental studies on English, where the grapheme-to-phoneme relationship is irregular, we show that the proposed graphemebased ASR approach (without using any phoneme information) can achieve performance comparable to standard phoneme-based ASR approach.

#10 Crosslingual tandem-SGMM: exploiting out-of-language data for acoustic model and feature level adaptation [PDF] [Copy] [Kimi¹]

Authors: Petr Motlicek ; David Imseng ; Philip N. Garner

Recent studies have shown that speech recognizers may benefit from data in languages other than the target language through efficient acoustic model- or feature-level adaptation. Crosslingual Tandem-Subspace Gaussian Mixture Models (SGMM) are successfully able to combine acoustic model- and feature-level adaptation techniques. More specifically, we focus on under-resourced languages (Afrikaans in our case) and perform feature-level adaptation through the estimation of phone class posterior features with a Multilayer Perceptron that was trained on data from a similar language with large amounts of available speech data (Dutch in our case). The same Dutch data can also be exploited on an acoustic model-level by training globally-shared SGMM parameters in a crosslingual way. The two adaptation techniques are indeed complementary and result in a crosslingual Tandem-SGMM system that yields relative improvement of about 22% compared to a standard speech recognizer on an Afrikaans phoneme recognition task. Interestingly, eventual score-level combination of the individual SGMM systems yields additional 3% relative improvement.

#11 Multilingual multilayer perceptron for rapid language adaptation between and across language families [PDF] [Copy] [Kimi¹]

Authors: Ngoc Thang Vu ; Tanja Schultz

In this paper, we present our latest investigations of multilingual Multilayer Perceptrons (MLPs) for rapid language adaptation between and across language families. We explore the impact of the amount of languages and data used for the multilingual MLP training process. We show that the overall system performance on the target language is significantly improved by initializing it with a multilingual MLP. Our experiments indicate that the more languages we use to train a multilingual MLP, the better is the initialization for MLP training. As a result, the ASR performance is improved, even if the target language and the source languages are not in the same language family. Our best results show an error rate improvement of up to 22.9% relative for different target languages (Czech, Hausa and Vietnamese) by using a multilingual MLP which has been trained with many different languages from the GlobalPhone corpus. In the case of very few training or adaptation data, an improvement of up to 24% relative in terms of error rate is observed.

#12 Modeling prosodic sequences with k-means and dirichlet process GMMs [PDF] [Copy] [Kimi¹]

Author: Andrew Rosenberg

In this paper we describe two unsupervised representations of prosodic sequences based on k-means and Dirichlet Process Gaussian Mixture Model (DPGMM) clustering. The clustering algorithms are used to infer an inventory of prosodic categories over automatically segmented syllables. A tri-gram model is trained over these sequences to characterize speech. We find that DPGMM clusters show a greater correspondence with manual ToBI labels than k-means clusters. However, sequence models trained on k-means clusters significantly outperform DPGMM sequences in classifying speaking style, nativeness and speakers. We also investigate the use of these sequence models in the detection of outliers regarding these three tasks. Non-parametric Bayesian techniques have the advantage of being able to learn a clustering solution and infer the number of clusters directly from data. While it is attractive to avoid specifying k before clustering, on the tasks of characterizing prosodic sequences we find that effective use of DPGMMs still requires a significant amount of parameter tuning, and performance fails to reach the level of k-means.

#13 Comparing computation in Gaussian mixture and neural network based large-vocabulary speech recognition [PDF] [Copy] [Kimi¹]

Authors: Vishwa Gupta ; Gilles Boulianne

In this paper we look at real-time computing issues in large vocabulary speech recognition. We use the French broadcast audio transcription task from ETAPE 2011 for this evaluation. We compare word error rate (WER) versus overall computing time for hidden Markov models with Gaussian mixtures (GMM-HMM) and deep neural networks (DNN-HMM). We show that for a similar computing during recognition, the DNN-HMM combination is superior to the GMM-HMM. For a real-time computing scenario, the error rate for the ETAPE dev set is 23.5% for DNN-HMM versus 27.9% for the GMM-HMM: a significant difference in accuracy for comparable computing. Rescoring lattices (generated by DNN-HMM acoustic model) with a quadgram language model (LM), and then with a neural net LM reduces the WER to 22.0% while still providing real-time computing.

#14 Simultaneous perturbation stochastic approximation for automatic speech recognition [PDF] [Copy] [Kimi¹]

Authors: Daniel Stein ; Jochen Schwenninger ; Michael Stadtschnitzer

While both the acoustic model and the language model in automatic speech recognition are typically well-trained on the target domain, the free parameters of the decoder itself are often set manually. In this paper, we investigate in how far a stochastic approximation algorithm can be employed to automatically determine the best parameters, especially if additional time-constraints are given on unknown machine architectures. We offer our findings on the German Difficult Speech Corpus, and present significant improvements over both the spontaneous and planned clean speech task.

#15 Hardware/software codesign for mobile speech recognition [PDF] [Copy] [Kimi¹]

Authors: David Sheffield ; Michael Anderson ; Yunsup Lee ; Kurt Keutzer

In this paper, we explore high performance software and hardware implementations of an automatic speech recognition system that can run locally on a mobile device. We automate the generation of key components of our speech recognition system using Three Fingered Jack, a tool for hardware/software codesign that maps computation to CPUs, data parallel processors, and custom hardware. We use Three Fingered Jack to explore energy and performance for two key kernels in our speech recognizer, the observation probability evaluation and across-word traversal. Through detailed hardware simulation and measurement, we produce accurate estimates for energy and area and show a significant energy improvement over a conventional mobile CPU.

#16 Exploiting the succeeding words in recurrent neural network language models [PDF] [Copy] [Kimi¹]

Authors: Yangyang Shi ; Martha Larson ; Pascal Wiggers ; Catholijn M. Jonker

In automatic speech recognition, conventional language models recognize the current word using only information from preceding words. Recently, Recurrent Neural Network Language Models (RNNLMs) have drawn increased research attention because of their ability to outperform conventional n-gram language models. The superiority of RNNLMs is based in their ability to capture longdistance word dependencies. RNNLMs are, in practice, applied in an N-best rescoring framework, which offers new possibilities for information integration. In particular, it becomes interesting to extend the ability of RNNLMs to capture long distance information by also allowing them to exploit information from succeeding words during the rescoring process. This paper proposes three approaches for exploiting succeeding word information in RNNLMs. The first is a forward-backward model that combines RNNLMs exploiting preceding and succeeding words. The second is an extension of a Maximum Entropy RNNLM (RNNME) that incorporates succeeding word information. The third is an approach that combines language models using two-pass alternating rescoring. Experimental results demonstrate the ability of succeeding word information to improve RNNLM performance, both in terms of perplexity and Word Error Rate (WER). The best performance is achieved by a combined model that exploits the three words succeeding the current word.

#17 Speech acoustic unit segmentation using hierarchical dirichlet processes [PDF] [Copy] [Kimi¹]

Authors: Amir Hossein Harati Nejad Torbati ; Joseph Picone ; Marc Sobel

Speech recognition systems have historically used contextdependent phones as acoustic units because these units allow linguistic information, such as a pronunciation lexicon, to be leveraged. However, when dealing with a new language for which minimal linguistic resources exist, it is desirable to automatically discover acoustic units. The process of discovering acoustic units usually consists of two stages: segmentation and clustering. In this paper, we focus on the segmentation portion of this problem. We introduce a nonparametric Bayesian approach for segmentation, based on Hierarchical Dirichlet Processes (HDP), in which a hidden Markov model (HMM) with an unbounded number of states is used to segment the utterance. This model is referred to as an HDP-HMM. We compare this algorithm to several popular heuristic methods and demonstrate an 11% improvement in finding boundaries on the TIMIT Corpus. A self-similarity measure over segments shows an 88% improvement compared to manual segmentation with comparable segment length. This work represents the first step in the development of a speech recognition system that is entirely based on nonparametric Bayesian models.

#18 Transducer-based speech recognition with dynamic language models [PDF¹] [Copy] [Kimi¹]

Authors: Munir Georges ; Stephan Kanthak ; Dietrich Klakow

In this paper, a method is proposed which embeds regular grammars into an N-gram Markov language model. This allows accurate speech recognition even for N-gram models estimated on sparse grammatical word sequences. Moreover, it allows explicit userdependent modelling of word sequences, such as phone numbers, email addresses or US ZIP codes, separately from the Markov model. The method is theoretically described along with a feasible implementation overview. More precisely, a language model preprocessing step generalizes the enclosed grammatical word sequences during language model learning. These grammars are embedded during speech decoding by using a novel transducer nesting technique. The Wall Street Journal corpus was used to evaluate the proposed method. We achieved a word error rate reduction of 31.1%. A computational environment was used, which is typical for car head units or mobile devices.

#19 A method for structure estimation of weighted finite-state transducers and its application to grapheme-to-phoneme conversion [PDF] [Copy] [Kimi¹]

Authors: Yotaro Kubo ; Takaaki Hori ; Atsushi Nakamura

Weighted finite-state transducers (WFSTs) are widely used as a fundamental data structure in several spoken language processing systems since they can provide a unified representation of many types of probabilistic models. Even though the use of accurate WFSTs is important in many spoken language systems, WFSTs are conventionally obtained by transforming probabilistic models that are not estimated in terms of WFST accuracy. Several recent techniques have enabled the direct optimization of weight parameters in WFSTs; however, these techniques do not optimize the structures of WFSTs directly. In this paper, with the goal of achieving a direct estimation of WFST structures from a dataset, we introduce a Bayesian method for structure inference. The proposed method employs the hierarchical Dirichlet process (HDP) as a prior process of generative processes of arcs in the WFSTs. Thanks to the flexibility of the HDP that enables the handling of countably infinite entities, the proposed method can potentially generate the infinite number of arcs in the WFSTs. The efficiency of the proposed method is verified by estimating WFSTs for grapheme-tophoneme (G2P) conversion. We confirmed that the WFST obtained by the proposed method realized a compact representation of G2P conversion compared with the conventional N-gram-based G2P models.

#20 Combining forward-based and backward-based decoders for improved speech recognition performance [PDF] [Copy] [Kimi¹]

Authors: Denis Jouvet ; Dominique Fohr

Combining outputs of speech recognizers is a known way of increasing speech recognition performance. The ROVER approach handles efficiently such combinations. In this paper we show that the best performance is not achieved by combining the outputs of the best set of recognizers, but rather by combining outputs of recognizers that rely on different processing components, and in particular on a different order (backward vs. forward) for processing speech frames. Indeed, much better speech recognition results were obtained by combining outputs of sphinx-based recognizers with outputs of Julius-based recognizers than by combining the same number of outputs from only sphinx-based recognizers, even if the individual sphinx-based systems led to better results than the individual Julius-based recognizers. Further experiments have also been conducted using sphinx-based tools for processing speech frames in reverse order (i.e. backward in time). The results clearly show that combining forward-based and backward-based decoders provide significant improvement with respect to a combination of forward only or backward only decoders. Experiments have been conducted on the ESTER2 and ETAPE speech corpora. Overall, combining sphinx-based and Julius-based systems led to 18.6% word error rate on ESTER2 test data, and 24.5% word error rate on ETAPE test data.

#21 ivector-based acoustic data selection [PDF] [Copy] [Kimi¹]

Authors: Olivier Siohan ; Michiel Bacchiani

This paper presents a data selection approach where spoken utterances are selected in a sequential fashion from a large out-of-domain data set to match the utterance distribution of an in-domain data set. We propose to represent each utterance by its iVector, a low dimensional vector indicating the coordinate of that utterance in a subspace acoustic model. We show that the distribution of iVectors can characterize a data set and enables distinguishing subsets of utterances from different domains. Last, we present experimental speech recognition results based on a system trained on a data set constructed by the proposed algorithm and a comparison with random data selection.

#22 Accurate and compact large vocabulary speech recognition on mobile devices [PDF] [Copy] [Kimi¹]

Authors: Xin Lei ; Andrew Senior ; Alexander Gruenstein ; Jeffrey Sorensen

In this paper we describe the development of an accurate, smallfootprint, large vocabulary speech recognizer for mobile devices. To achieve the best recognition accuracy, state-of-the-art deep neural networks (DNNs) are adopted as acoustic models. A variety of speedup techniques for DNN score computation are used to enable real-time operation on mobile devices. To reduce the memory and disk usage, on-the-fly language model (LM) rescoring is performed with a compressed n-gram LM. We were able to build an accurate and compact system that runs well below real-time on a Nexus 4 Android phone.

#23 Pre-initialized composition for large-vocabulary speech recognition [PDF] [Copy] [Kimi¹]

Authors: Cyril Allauzen ; Michael Riley

This paper describes a modified composition algorithm that is used for combining two finite-state transducers, representing the context-dependent lexicon and the language model respectively, in large vocabulary speech recognition. This algorithm is a hybrid between the static and dynamic expansion of the resultant transducer, which maps from context-dependent phones to words and is searched during decoding. The approach is to pre-compute part of the recognition transducer and leave the balance to be expanded during decoding. This method allows for a fine-grained trade-off between space and time in recognition. For example, the time overhead of purely dynamic expansion can be reduced by over six-fold with only a 20% increase in memory in a collection of large-vocabulary recognition tasks available on the Google Android platform.

#24 Speaker dependent activation keyword detector based on GMM-UBM [PDF] [Copy] [Kimi³]

Authors: Evelyn Kurniawati ; Sapna George

In this paper, we present a new method for isolated keyword detection that is meant to activate a personal device from standby state. Instead of using the common method for speech recognition such as Hidden Markov Model (HMM) or Dynamic Time Warping (DTW), we modify a GMM-UBM (Gaussian Mixture Model . Universal Background Model) scheme that is better known in speaker recognition field. Since only one adapted Gaussian mixture is used to represent the keyword, a second layer of check is employed to ensure the right sequence of occurrence within the keyword. This is done by comparing it with the Longest Common Subsequence (LCS) of the highest performing GMM component obtained during the registration phase. Results for a subset of the SpeechDat-Car database are presented to validate the benefit of this modeling against moderate noise level.

#25 Written-domain language modeling for automatic speech recognition [PDF] [Copy] [Kimi¹]

Authors: Haşim Sak ; Yun-hsuan Sung ; Françoise Beaufays ; Cyril Allauzen

Language modeling for automatic speech recognition (ASR) systems has been traditionally in the verbal domain. In this paper, we present finite-state modeling techniques that we developed for language modeling in the written domain. The first technique we describe is for the verbalization of written-domain vocabulary items, which include lexical and non-lexical entities. The second technique is the decomposition-recomposition approach to address the out-of-vocabulary (OOV) and the data sparsity problems with non-lexical entities such as URLs, email addresses, phone numbers, and dollar amounts. We evaluate the proposed written-domain language modeling approaches on a very large vocabulary speech recognition system for English. We show that the written-domain language modeling improves the speech recognition and the ASR transcript rendering accuracy in the written domain over a baseline system using a verbal-domain language model. In addition, the written-domain system is much simpler since it does not require complex and error-prone text normalization and denormalization rules, which are generally required for verbal-domain language modeling.